You have been provided some customer data from a shopping centre. Do some exploratory analysis on the data. Comment on findings.
We are interested in creating a marketing campaign to target customers based on their spending score and annual income. Perform a k-means clustering to find if there are meaningful clusters in the data to target the customers.
Perform k-means clustering and chose a value of k.
Visualise the clustering for your chosen value of k.
Do you think the clustering seems a good fit for this data?
Comment on the attributes on one or two of the clusters (maybe even give them a label if you like - like in section 4.1 of the ‘Segmentation & clustering intro’ lesson).
library(tidyverse)
library(janitor)
library(broom)
library(animation)
customers <- read_csv("data/mall_customers.csv") %>%
clean_names()
print("NAs found in data?")
## [1] "NAs found in data?"
any(is.na(customers))
## [1] FALSE
Not part of the brief, but just having a quick look.
I’ve added a red line to show how this could perhaps be divided into two clusters - younger people tending to have a higher spending score
customers %>%
ggplot() +
aes(x = age, y = spending_score_1_100) +
geom_point()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black")) +
geom_segment(aes(x = 26, y = 00, xend = 45, yend = 100), colour = "red", linetype = "dashed") +
labs(title = "Spending score by age \n",
x = "\n Age",
y = "Spending score \n"
)
I’ve added red lines to highlight five clusters which are clear in the data.
These are consistent by gender.
As the groups are very distinct, I would tend to assume this is an educational dataset created specifically for this purpose.
Some notes on the clusters:
People earning <40k: two spending score clusters of 0-40 and 60-100
People earning 40-70k: one spending score cluster of 40-60
People earning >70k: two spending score clusters of 0-40 and 60-100
customers %>%
ggplot() +
aes(x = annual_income_k, y = spending_score_1_100) +
geom_point()+
scale_x_continuous(breaks = c(0, 20, 40, 60, 80, 100, 120, 140, 160)) +
scale_y_continuous(breaks = c(0, 20, 40, 60, 80, 100)) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black")) +
facet_wrap(~ gender) +
geom_segment(aes(x = 0, y = 40, xend = 110, yend = 40), colour = "red", linetype = "dashed") +
geom_segment(aes(x = 0, y = 60, xend = 110, yend = 60), colour = "red", linetype = "dashed") +
geom_segment(aes(x = 40, y = 0, xend = 40, yend = 100), colour = "red", linetype = "dashed") +
geom_segment(aes(x = 70, y = 0, xend = 70, yend = 100), colour = "red", linetype = "dashed") +
labs(title = "Spending score by annual income and gender \n",
x = "\n Annual income (000s)",
y = "Spending score \n"
)
customers %>%
summarise(mean_income = round(mean(annual_income_k)),
sd_income = sd(annual_income_k))
## # A tibble: 1 x 2
## mean_income sd_income
## <dbl> <dbl>
## 1 61 26.3
customers %>%
summarise(mean_spending_score = round(mean(spending_score_1_100)),
sd_spending_score = sd(spending_score_1_100))
## # A tibble: 1 x 2
## mean_spending_score sd_spending_score
## <dbl> <dbl>
## 1 50 25.8
income_spend_scaled <- customers %>%
select(annual_income_k, spending_score_1_100) %>%
mutate_if(is.numeric, scale)
head(income_spend_scaled)
## # A tibble: 6 x 2
## annual_income_k[,1] spending_score_1_100[,1]
## <dbl> <dbl>
## 1 -1.73 -0.434
## 2 -1.73 1.19
## 3 -1.70 -1.71
## 4 -1.70 1.04
## 5 -1.66 -0.395
## 6 -1.66 0.999
income_spend_scaled %>%
summarise(mean_income = round(mean(annual_income_k)),
sd_income = sd(annual_income_k))
## # A tibble: 1 x 2
## mean_income sd_income
## <dbl> <dbl>
## 1 0 1
income_spend_scaled %>%
summarise(mean_spending_score = round(mean(spending_score_1_100)),
sd_spending_score = sd(spending_score_1_100))
## # A tibble: 1 x 2
## mean_spending_score sd_spending_score
## <dbl> <dbl>
## 1 0 1
I’m creating five clusters, based on the plots I created in my exploratory analysis.
income_spend_clustered <- kmeans(income_spend_scaled,
centers = 5)
tidy(income_spend_clustered,
col.names = colnames(income_spend_scaled))
## # A tibble: 5 x 5
## annual_income_k spending_score_1_100 size withinss cluster
## <dbl> <dbl> <int> <dbl> <fct>
## 1 -1.33 1.13 22 5.22 1
## 2 -1.30 -1.13 23 7.58 2
## 3 0.989 1.24 39 19.7 3
## 4 1.05 -1.28 35 18.3 4
## 5 -0.200 -0.0264 81 14.5 5
I’m making this into a new table which I may use for deciding which customers to target
?? why is this showing different figures when I knit it
customer_with_cluster <-
augment(income_spend_clustered, customers) %>%
rename("cluster" = ".cluster") %>%
mutate(cluster_name = recode(cluster, "1" = "Mid income, mid spend",
"2" = "High income, low spend",
"3" = "Lower income, high spend",
"4" = "High income, high spend",
"5" = "Lower income, low spend",))
customer_with_cluster %>%
group_by(cluster, cluster_name) %>%
summarise(avg_income = mean(annual_income_k),
avg_spending_score = mean(spending_score_1_100)) %>%
arrange(avg_income)
## # A tibble: 5 x 4
## # Groups: cluster [5]
## cluster cluster_name avg_income avg_spending_score
## <fct> <fct> <dbl> <dbl>
## 1 1 Mid income, mid spend 25.7 79.4
## 2 2 High income, low spend 26.3 20.9
## 3 5 Lower income, low spend 55.3 49.5
## 4 3 Lower income, high spend 86.5 82.1
## 5 4 High income, high spend 88.2 17.1
income_spend_scaled %>%
kmeans.ani(centers = 5)
This is pretty clear for this dataset, but doing it for practice.
The elbow point of the graph is 5, which ties up with the intitial exploritary analysis of the data
glance(income_spend_clustered)
## # A tibble: 1 x 4
## totss tot.withinss betweenss iter
## <dbl> <dbl> <dbl> <int>
## 1 398. 65.2 333. 3
max_k <- 20
k_clusters <- tibble(k = 1:max_k) %>%
mutate(
kclust = map(k, ~ kmeans(income_spend_scaled, .x, nstart = 25)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, customers)
)
head(k_clusters,3)
## # A tibble: 3 x 5
## k kclust tidied glanced augmented
## <int> <list> <list> <list> <list>
## 1 1 <kmeans> <tibble [1 × 5]> <tibble [1 × 4]> <tibble [200 × 6]>
## 2 2 <kmeans> <tibble [2 × 5]> <tibble [1 × 4]> <tibble [200 × 6]>
## 3 3 <kmeans> <tibble [3 × 5]> <tibble [1 × 4]> <tibble [200 × 6]>
clusterings <- k_clusters %>%
unnest(glanced)
head(clusterings,3)
## # A tibble: 3 x 8
## k kclust tidied totss tot.withinss betweenss iter augmented
## <int> <list> <list> <dbl> <dbl> <dbl> <int> <list>
## 1 1 <kmeans> <tibble [1 … 398. 398. -2.27e-13 1 <tibble [200 ×…
## 2 2 <kmeans> <tibble [2 … 398. 268. 1.30e+ 2 1 <tibble [200 ×…
## 3 3 <kmeans> <tibble [3 … 398. 157. 2.41e+ 2 3 <tibble [200 ×…
ggplot(clusterings, aes(x=k, y=tot.withinss)) +
geom_point() +
geom_line() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black")) +
scale_x_continuous(breaks = seq(1, 20, by = 1)) +
geom_segment(aes(x = 5, y = 0, xend = 5, yend = 400), colour = "red", linetype = "dashed")
Clustering is a good fit for this dataset, which has clearly-defined groupings, already clearly visible in a simple scatter plot even before performing further analysis.
The clusters are summarised in the output below - points to note which may be useful when deciding which customers to target:
Income alone is not necessarily a good measure of which customers to target - there are high-spending clusters in both the low and high income brackets
The cluster with the highest number of customers is Mid income, mid spend; it may be worth considering this group even though it does not account for the highest spending scores (e.g. if reaching the both the low and high earners who have high spending scores would require two separate campaigns with associated costs)
customer_with_cluster %>%
group_by(cluster_name) %>%
summarise(number_of_customers = n(),
avg_income = mean(annual_income_k),
avg_spending_score = mean(spending_score_1_100)) %>%
arrange(avg_income)
## # A tibble: 5 x 4
## cluster_name number_of_customers avg_income avg_spending_score
## <fct> <int> <dbl> <dbl>
## 1 Mid income, mid spend 22 25.7 79.4
## 2 High income, low spend 23 26.3 20.9
## 3 Lower income, low spend 81 55.3 49.5
## 4 Lower income, high spend 39 86.5 82.1
## 5 High income, high spend 35 88.2 17.1
customer_with_cluster %>%
group_by(cluster_name) %>%
summarise(number_of_customers = n()) %>%
ggplot() +
aes(x = reorder(cluster_name, -number_of_customers), y = number_of_customers) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black")) +
theme(axis.text.x = element_text(angle=45,hjust=1)) +
geom_col(fill = "steel blue") +
labs(title = "Number of customers by cluster \n",
x = "\n Cluster name",
y = "Number of customers \n"
)